In [1]:

    
import pandas as pd



In [2]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Review of NumPy arrays

NumPy arrays are fixed-size containers for homogeneous data. For example,

an array of integers:



In [3]:

    
def print_array(a):
    print("{} elements of type {}: {}".format(len(a), a.dtype.name, a))
    
a = np.array([1,2,3])
print_array(a)









    



3 elements of type int64: [1 2 3]

or an array of real numbers:



In [4]:

    
a = np.array([1.5, 2.5, 3.5])
print_array(a)









    



3 elements of type float64: [ 1.5  2.5  3.5]

or an array of strings:



In [5]:

    
a = np.array(['cześć', 'software', 'carpentry'])
print_array(a)









    



3 elements of type str288: ['cześć' 'software' 'carpentry']

Accessing (indexing) elements

Single elements can be retrieved by integer (!) index of the element starting from 0:



In [6]:

    
a = np.array([101, 102, 103, 104, 105])
print(a[1])

Sub-array of consecutive elements can be retrived with a slice:



In [7]:

    
print(a[1:3])

Two-dimensional arrays

2D arrays in NumPy are like matrices — they have columns and rows. To retrieve an element from the array we need to pass two indices or slices.



In [8]:

    
a = np.arange(12).reshape(3, 4)
print(a)









    



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]



In [9]:

    
print(a[1, 2])



In [10]:

    
print(a[1:, 2:])









    



[[ 6  7]
 [10 11]]

EXERCISE: Add the first and third row of the array a

Pandas data structures

Pandas defines two fundamental object types, both built upon NumPy arrays: the Series object, and the DataFrame object.

Series

A Series is a basic holder for one-dimensional labeled data. It can be created like a NumPy array:



In [11]:

    
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s









    Out[11]:





0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

Attributes of a Series: `index` and `values`

The series has a built-in concept of an index, which by default is the numbers 0 through N - 1



In [12]:

    
s.index









    Out[12]:





Int64Index([0, 1, 2, 3], dtype='int64')

You can access the underlying numpy array representation with the .values attribute:



In [13]:

    
s.values









    Out[13]:





array([ 0.1,  0.2,  0.3,  0.4])

We can access series values via the index, just like for NumPy arrays:



In [14]:

    
s[0]









    Out[14]:





0.10000000000000001

Unlike the NumPy array, though, this index can be something other than integers:



In [15]:

    
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2









    Out[15]:





a    0
b    1
c    2
d    3
dtype: int64



In [16]:

    
s2['c']









    Out[16]:





2

It's possible to construct a series directly from a Python dictionary. Let's first define the dictionary.



In [17]:

    
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
pop_dict['Germany']









    Out[17]:





81.3

Trying to access non-existing keys in a dictionary will produce an error:



In [18]:

    
# pop_dict['Poland']

But we can add new keys easily:



In [19]:

    
pop_dict['Poland'] = 40
pop_dict









    Out[19]:





{'Belgium': 11.3,
 'France': 64.3,
 'Germany': 81.3,
 'Netherlands': 16.9,
 'Poland': 40,
 'United Kingdom': 64.9}

NumPy-style arithmetical operations won't work:



In [20]:

    
#pop_dict * 1000

Now we construct a Series object from the dictionary.



In [21]:

    
population = pd.Series(pop_dict)
population









    Out[21]:





Belgium           11.3
France            64.3
Germany           81.3
Netherlands       16.9
Poland            40.0
United Kingdom    64.9
dtype: float64

We can index the populations like a dict as expected:



In [22]:

    
population['France']









    Out[22]:





64.299999999999997

but with the power of numpy arrays:



In [23]:

    
population * 1000









    Out[23]:





Belgium           11300
France            64300
Germany           81300
Netherlands       16900
Poland            40000
United Kingdom    64900
dtype: float64

Many things we have seen for NumPy, can also be used with pandas objects.

Slicing:



In [24]:

    
population['Belgium':'Germany']









    Out[24]:





Belgium    11.3
France     64.3
Germany    81.3
dtype: float64

A range of methods:



In [25]:

    
population.mean()









    Out[25]:





46.449999999999996

EXERCISE: Calculate how big is the population of each country relative to France



In [ ]:

EXERCISE: Define the following Series containing prices of beverages:

</div>

Beer              5
Coffee            2.5
Orange Juice      5
Water             2
Wine              6



In [ ]:

DataFrames: Multi-dimensional Data

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the data frame will display in a rich HTML view:



In [28]:

    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries









    Out[28]:






  
    
      
      area
      capital
      country
      population
    
  
  
    
      0
      30510
      Brussels
      Belgium
      11.3
    
    
      1
      671308
      Paris
      France
      64.3
    
    
      2
      357050
      Berlin
      Germany
      81.3
    
    
      3
      41526
      Amsterdam
      Netherlands
      16.9
    
    
      4
      244820
      London
      United Kingdom
      64.9

Attributes of the DataFrame

A DataFrame has besides a index attribute, also a columns attribute:



In [29]:

    
countries.index









    Out[29]:





Int64Index([0, 1, 2, 3, 4], dtype='int64')



In [30]:

    
countries.columns









    Out[30]:





Index(['area', 'capital', 'country', 'population'], dtype='object')

To check the data types of the different columns:



In [31]:

    
countries.dtypes









    Out[31]:





area            int64
capital        object
country        object
population    float64
dtype: object

An overview of that information can be given with the info() method:



In [32]:

    
countries.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
area          5 non-null int64
capital       5 non-null object
country       5 non-null object
population    5 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 200.0+ bytes

Also a DataFrame has a values attribute which returns its numpy representation:



In [33]:

    
countries.values









    Out[33]:





array([[30510, 'Brussels', 'Belgium', 11.3],
       [671308, 'Paris', 'France', 64.3],
       [357050, 'Berlin', 'Germany', 81.3],
       [41526, 'Amsterdam', 'Netherlands', 16.9],
       [244820, 'London', 'United Kingdom', 64.9]], dtype=object)

If we don't like what the index looks like, we can reset it and set one of our columns:



In [34]:

    
countries = countries.set_index('country')
countries









    Out[34]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
    
    
      France
      671308
      Paris
      64.3
    
    
      Germany
      357050
      Berlin
      81.3
    
    
      Netherlands
      41526
      Amsterdam
      16.9
    
    
      United Kingdom
      244820
      London
      64.9

To access a Series representing a column in the data, use typical indexing syntax:



In [35]:

    
countries['area']









    Out[35]:





country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.

For example there's arithmetic. Let's compute density of each country:



In [36]:

    
countries['population']*1000000 / countries['area']









    Out[36]:





country
Belgium           370.370370
France             95.783158
Germany           227.699202
Netherlands       406.973944
United Kingdom    265.092721
dtype: float64

Adding a new column to the dataframe is very simple:



In [37]:

    
countries['density'] = countries['population']*1000000 / countries['area']
countries









    Out[37]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
      370.370370
    
    
      France
      671308
      Paris
      64.3
      95.783158
    
    
      Germany
      357050
      Berlin
      81.3
      227.699202
    
    
      Netherlands
      41526
      Amsterdam
      16.9
      406.973944
    
    
      United Kingdom
      244820
      London
      64.9
      265.092721

And we can do things like sorting the items in the array, and indexing to take the first two rows:



In [38]:

    
countries.sort_values(by='density', ascending=False)









    Out[38]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      Netherlands
      41526
      Amsterdam
      16.9
      406.973944
    
    
      Belgium
      30510
      Brussels
      11.3
      370.370370
    
    
      United Kingdom
      244820
      London
      64.9
      265.092721
    
    
      Germany
      357050
      Berlin
      81.3
      227.699202
    
    
      France
      671308
      Paris
      64.3
      95.783158

One useful method to use is the describe method, which computes summary statistics for each column:



In [39]:

    
countries.describe()









    Out[39]:






  
    
      
      area
      population
      density
    
  
  
    
      count
      5.000000
      5.000000
      5.000000
    
    
      mean
      269042.800000
      47.740000
      273.183879
    
    
      std
      264012.827994
      31.519645
      123.440607
    
    
      min
      30510.000000
      11.300000
      95.783158
    
    
      25%
      41526.000000
      16.900000
      227.699202
    
    
      50%
      244820.000000
      64.300000
      265.092721
    
    
      75%
      357050.000000
      64.900000
      370.370370
    
    
      max
      671308.000000
      81.300000
      406.973944

The plot method can be used to quickly visualize the data in different ways:



In [40]:

    
countries.plot()









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f65f9636a20>

However, for this dataset, it does not say that much:



In [41]:

    
countries['population'].plot(kind='bar')









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f65f7d3ecf8>

You can play with the kind keyword: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'

EXERCISE: Define a `DataFrame` with two columns: price and volume of each beverage. Use the bevarage name as the index. Add the new column ``price of litre``, the data frame according to the values of the column, and plot the values as a bar plot.



In [ ]:



In [ ]:

Importing and exporting data

A wide range of input/output formats are natively supported by pandas:

CSV, text
SQL database
Excel
HDF5
json
html
pickle
...



In [44]:

    
pd.read_csv









    Out[44]:





<function pandas.io.parsers._make_parser_function.<locals>.parser_f>



In [45]:

    
countries.to_csv









    Out[45]:





<bound method DataFrame.to_csv of                   area    capital  population     density
country                                                  
Belgium          30510   Brussels        11.3  370.370370
France          671308      Paris        64.3   95.783158
Germany         357050     Berlin        81.3  227.699202
Netherlands      41526  Amsterdam        16.9  406.973944
United Kingdom  244820     London        64.9  265.092721>

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).

© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)

Licensed under CC BY 4.0 Creative Commons

This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).



In [ ]:

	area	capital	country	population
0	30510	Brussels	Belgium	11.3
1	671308	Paris	France	64.3
2	357050	Berlin	Germany	81.3
3	41526	Amsterdam	Netherlands	16.9
4	244820	London	United Kingdom	64.9

	area	capital	population
country
Belgium	30510	Brussels	11.3
France	671308	Paris	64.3
Germany	357050	Berlin	81.3
Netherlands	41526	Amsterdam	16.9
United Kingdom	244820	London	64.9

	area	population	density
count	5.000000	5.000000	5.000000
mean	269042.800000	47.740000	273.183879
std	264012.827994	31.519645	123.440607
min	30510.000000	11.300000	95.783158
25%	41526.000000	16.900000	227.699202
50%	244820.000000	64.300000	265.092721
75%	357050.000000	64.900000	370.370370
max	671308.000000	81.300000	406.973944